progect description

Let's imagine that we work in a startup that sells food products and our task is toinvestigate user behavior for the company's app:

  • study the sales funnel,
  • look at the results of an A/A/B test,
  • formulate statistical hypotheses.

The conclusions that I will draw from it will help me improve the company conversion by interpreting user behavior and clarifying the results of statistical tests.

Table of Contents

Step 1. Downloading the data

We will use 9 libraries:

  • pandas: for data processing
  • numpy, math: for calculations
  • plotly express: for data visualisation
  • datetime: for working with data
  • scipy for hypotheses testing
  • sys, warnings: for not showing the warnings
  • iterals: for nice combinations
In [113]:
import pandas as pd
import numpy as np
import datetime
from datetime import timedelta
from datetime import datetime
import plotly.express as px
import plotly.graph_objects as go
import math as mth
from scipy import stats as st
import re
import itertools
from plotly.offline import iplot, init_notebook_mode
import matplotlib.pyplot as plt
import matplotlib as mpl
import sys
import warnings
if not sys.warnoptions:
    warnings.simplefilter("ignore")
import seaborn as sns
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
plt.style.use('fivethirtyeight')

Let's set some parameters for ploting

In [114]:
mpl.rcParams['lines.linewidth'] = 2
mpl.rcParams["figure.figsize"] = [8, 6]
mpl.rcParams.update({"axes.grid": True, "grid.color": "grey"})
mpl.rcParams['image.cmap'] = 'gray'
mpl.rcParams['figure.dpi'] = 80
mpl.rcParams['savefig.dpi'] = 100
mpl.rcParams['font.size'] = 12
mpl.rcParams['legend.fontsize'] = 'large'
mpl.rcParams['figure.titlesize'] = 'medium'
In [115]:
try:
    logs_exp = pd.read_csv('/datasets/logs_exp_us.csv', sep='\t', dtype={'EventName': 'category',
                                                                         'ExpId': 'category'})  # practicum path
except:
    try:
        logs_exp = pd.read_csv('./datasets/logs_exp_us.csv', sep='\t', dtype={'EventName': 'category',
                                                                              'ExpId': 'category'})  # local path
    except:
        try:
            logs_exp = pd.read_csv('https://code.s3.yandex.net//datasets/logs_exp_us.csv', sep='\t', dtype={'EventName': 'category',
                                                                                                            'ExpId': 'category'})  # loading path
        except FileNotFoundError:
            print('Ooops, the dateset not found.')

        except pd.errors.EmptyDataError:
            print('Ooops, the dataset is empty.')

Let's downcast our data so it wouldn't take to much space

In [116]:
logs_exp['DeviceIDHash'] = pd.to_numeric(logs_exp['DeviceIDHash'], downcast='integer')
logs_exp['EventTimestamp'] = pd.to_numeric(logs_exp['EventTimestamp'], downcast='integer')

logs_exp.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244126 entries, 0 to 244125
Data columns (total 4 columns):
 #   Column          Non-Null Count   Dtype   
---  ------          --------------   -----   
 0   EventName       244126 non-null  category
 1   DeviceIDHash    244126 non-null  int64   
 2   EventTimestamp  244126 non-null  int32   
 3   ExpId           244126 non-null  category
dtypes: category(2), int32(1), int64(1)
memory usage: 3.3 MB

Conclusion

We successfully opened the dataset. The dataset contains 244126 lines, 2 category columns, 2 integer columns. Let's se how we can preprocess it

Step 2. Preprocessing the data

Renaming the columns

In [117]:
logs_exp.columns = ['event_name', 'user_id', 'timestamp', 'experiment_id']
In [118]:
logs_exp.user_id.nunique()
Out[118]:
7551

Checking for missing values and data types

In [119]:
logs_exp.describe(include='all')
Out[119]:
event_name user_id timestamp experiment_id
count 244126 2.441260e+05 2.441260e+05 244126
unique 5 NaN NaN 3
top MainScreenAppear NaN NaN 248
freq 119205 NaN NaN 85747
mean NaN 4.627568e+18 1.564914e+09 NaN
std NaN 2.642425e+18 1.771343e+05 NaN
min NaN 6.888747e+15 1.564030e+09 NaN
25% NaN 2.372212e+18 1.564757e+09 NaN
50% NaN 4.623192e+18 1.564919e+09 NaN
75% NaN 6.932517e+18 1.565075e+09 NaN
max NaN 9.222603e+18 1.565213e+09 NaN
In [120]:
round(logs_exp.experiment_id.value_counts(normalize=True) * 100,2)
Out[120]:
248    35.12
246    32.89
247    31.98
Name: experiment_id, dtype: float64

Duplicates

In [121]:
for i in logs_exp[logs_exp.duplicated()].columns:
    print(i, ':', logs_exp[logs_exp.duplicated()][i].nunique())
event_name : 5
user_id : 237
timestamp : 352
experiment_id : 3
In [122]:
print(f'procentage of duplicates is {round(logs_exp.duplicated().sum() / logs_exp.shape[0] * 100,2)}%')
procentage of duplicates is 0.17%
In [123]:
logs_exp = logs_exp.drop_duplicates()

date and time column + dates column

In [124]:
logs_exp['timestamp'] = logs_exp.timestamp.apply(lambda x:datetime.fromtimestamp(x))
logs_exp['date'] = logs_exp['timestamp'].astype('datetime64[D]')

Conclusion

We found really small amount of duplicated lines, in real worlв we should report this, cause it means, something is wrong obtained data. We found no missing values, but created the 'timestamp' and 'date' columns which will help us later on. Also we renamed column names to officially acceped naming format. Also the proportions for experiments look equal

Step 3. Data Discovery

How many events are in the logs?

In [125]:
print(f'we have {logs_exp.event_name.nunique()} unique events and {logs_exp.shape[0]} events in general in the logs dataset')
we have 5 unique events and 243713 events in general in the logs dataset

How many users are in the logs?

In [126]:
logs_exp.user_id.nunique()
Out[126]:
7551

What's the average number of events per user?

In [127]:
round(logs_exp.groupby('user_id')['event_name'].count().mean(),2)
Out[127]:
32.28

What period of time does the data cover? Find the maximum and the minimum date.

In [128]:
print(f"The research period is from {logs_exp.date.min()} to {logs_exp.date.max()} covering {(logs_exp.date.max() - logs_exp.date.min())/ np.timedelta64(1, 'D') + 1} days")
The research period is from 2019-07-25 00:00:00 to 2019-08-08 00:00:00 covering 15.0 days

Date and time histogram

In [129]:
fig = px.histogram(logs_exp, x="timestamp", title='amount of events distribution')

fig.show()

We can see really clear that August data has it's own pattern, but I want to get sure that the 1st August also belong to the pattern model. Also I want to check the hour events distribution, because I beleive that this falls happen due to night hour lower visitor flow and peaks - due to daytime visitors. What wa the reason for having such low values before August - possibly technical problems.

In [130]:
logs_exp['hour'] = logs_exp.timestamp.dt.round('H')
In [131]:
logs_exp['only_hour'] = logs_exp['hour'].dt.hour
In [132]:
fig = px.histogram(logs_exp, x="only_hour", title='total amount of events per hour')
fig.show()

As I expected, the night hours have the lowest amount of events. But I want to get to the mean amountnt of events when the system recieved data correctly, not the whole sum of events.

In [133]:
min_amount= 81
normal_hours = logs_exp.groupby(['date','only_hour'])['event_name'].count().unstack()
event_per_hour = pd.DataFrame(normal_hours[normal_hours > min_amount].mean(axis=0))
event_per_hour.columns = ['amount']
In [134]:
fig = px.histogram(event_per_hour, x=event_per_hour.index, y='amount', nbins=24,
                   labels={'amount':'average events per hour'}, # can specify one label per df column
                   opacity=0.8,
                   color_discrete_sequence=['indianred'], title='mean amount of events per hour') # color of histogram bars

fig.show()

Now I have awerage amount of events for eventsery hour and can compare it to the recieved values.

In [135]:
per_day_per_hour = logs_exp.groupby(['date','only_hour'])['event_name'].count().reset_index(drop=False)

per_day_per_hour = per_day_per_hour.merge(
    event_per_hour, how='left', left_on='only_hour', right_on=event_per_hour.index)

per_day_per_hour.columns = ['date', 'hour', 'events_num', 'mean_events_num']
per_day_per_hour['diff'] = per_day_per_hour['mean_events_num'] -per_day_per_hour['events_num']

per_day_per_hour
Out[135]:
date hour events_num mean_events_num diff
0 2019-07-25 0 1 804.285714 803.285714
1 2019-07-25 8 1 996.857143 995.857143
2 2019-07-25 14 3 2140.714286 2137.714286
3 2019-07-25 15 2 2199.000000 2197.000000
4 2019-07-25 18 1 2227.500000 2226.500000
5 2019-07-25 23 1 1308.571429 1307.571429
6 2019-07-26 7 1 722.857143 721.857143
7 2019-07-26 9 1 1384.000000 1383.000000
8 2019-07-26 10 3 1514.142857 1511.142857
9 2019-07-26 13 1 1881.625000 1880.625000
10 2019-07-26 15 2 2199.000000 2197.000000
11 2019-07-26 16 1 2220.857143 2219.857143
12 2019-07-26 17 6 2129.375000 2123.375000
13 2019-07-26 19 2 2227.750000 2225.750000
14 2019-07-26 20 3 2017.375000 2014.375000
15 2019-07-26 21 4 1777.000000 1773.000000
16 2019-07-26 22 2 1514.250000 1512.250000
17 2019-07-26 23 4 1308.571429 1304.571429
18 2019-07-27 3 1 181.428571 180.428571
19 2019-07-27 7 1 722.857143 721.857143
20 2019-07-27 10 4 1514.142857 1510.142857
21 2019-07-27 11 3 1740.285714 1737.285714
22 2019-07-27 12 6 1968.285714 1962.285714
23 2019-07-27 13 10 1881.625000 1871.625000
24 2019-07-27 14 4 2140.714286 2136.714286
25 2019-07-27 15 1 2199.000000 2198.000000
26 2019-07-27 16 2 2220.857143 2218.857143
27 2019-07-27 17 2 2129.375000 2127.375000
28 2019-07-27 18 2 2227.500000 2225.500000
29 2019-07-27 19 4 2227.750000 2223.750000
30 2019-07-27 20 4 2017.375000 2013.375000
31 2019-07-27 21 8 1777.000000 1769.000000
32 2019-07-27 22 3 1514.250000 1511.250000
33 2019-07-28 3 1 181.428571 180.428571
34 2019-07-28 5 1 307.428571 306.428571
35 2019-07-28 8 11 996.857143 985.857143
36 2019-07-28 10 5 1514.142857 1509.142857
37 2019-07-28 11 6 1740.285714 1734.285714
38 2019-07-28 12 5 1968.285714 1963.285714
39 2019-07-28 13 4 1881.625000 1877.625000
40 2019-07-28 15 3 2199.000000 2196.000000
41 2019-07-28 16 20 2220.857143 2200.857143
42 2019-07-28 17 5 2129.375000 2124.375000
43 2019-07-28 18 7 2227.500000 2220.500000
44 2019-07-28 19 9 2227.750000 2218.750000
45 2019-07-28 20 7 2017.375000 2010.375000
46 2019-07-28 21 12 1777.000000 1765.000000
47 2019-07-28 22 3 1514.250000 1511.250000
48 2019-07-28 23 5 1308.571429 1303.571429
49 2019-07-29 0 4 804.285714 800.285714
50 2019-07-29 5 2 307.428571 305.428571
51 2019-07-29 7 1 722.857143 721.857143
52 2019-07-29 8 5 996.857143 991.857143
53 2019-07-29 9 1 1384.000000 1383.000000
54 2019-07-29 10 1 1514.142857 1513.142857
55 2019-07-29 11 5 1740.285714 1735.285714
56 2019-07-29 12 7 1968.285714 1961.285714
57 2019-07-29 13 11 1881.625000 1870.625000
58 2019-07-29 14 9 2140.714286 2131.714286
59 2019-07-29 15 6 2199.000000 2193.000000
60 2019-07-29 16 9 2220.857143 2211.857143
61 2019-07-29 17 20 2129.375000 2109.375000
62 2019-07-29 18 20 2227.500000 2207.500000
63 2019-07-29 19 10 2227.750000 2217.750000
64 2019-07-29 20 30 2017.375000 1987.375000
65 2019-07-29 21 20 1777.000000 1757.000000
66 2019-07-29 22 11 1514.250000 1503.250000
67 2019-07-29 23 9 1308.571429 1299.571429
68 2019-07-30 0 9 804.285714 795.285714
69 2019-07-30 1 2 413.571429 411.571429
70 2019-07-30 4 1 251.428571 250.428571
71 2019-07-30 5 1 307.428571 306.428571
72 2019-07-30 6 3 532.428571 529.428571
73 2019-07-30 7 9 722.857143 713.857143
74 2019-07-30 8 8 996.857143 988.857143
75 2019-07-30 9 17 1384.000000 1367.000000
76 2019-07-30 10 13 1514.142857 1501.142857
77 2019-07-30 11 19 1740.285714 1721.285714
78 2019-07-30 12 19 1968.285714 1949.285714
79 2019-07-30 13 22 1881.625000 1859.625000
80 2019-07-30 14 34 2140.714286 2106.714286
81 2019-07-30 15 33 2199.000000 2166.000000
82 2019-07-30 16 21 2220.857143 2199.857143
83 2019-07-30 17 23 2129.375000 2106.375000
84 2019-07-30 18 29 2227.500000 2198.500000
85 2019-07-30 19 39 2227.750000 2188.750000
86 2019-07-30 20 33 2017.375000 1984.375000
87 2019-07-30 21 24 1777.000000 1753.000000
88 2019-07-30 22 19 1514.250000 1495.250000
89 2019-07-30 23 19 1308.571429 1289.571429
90 2019-07-31 0 28 804.285714 776.285714
91 2019-07-31 1 9 413.571429 404.571429
92 2019-07-31 2 5 194.571429 189.571429
93 2019-07-31 3 3 181.428571 178.428571
94 2019-07-31 4 2 251.428571 249.428571
95 2019-07-31 5 2 307.428571 305.428571
96 2019-07-31 6 7 532.428571 525.428571
97 2019-07-31 7 18 722.857143 704.857143
98 2019-07-31 8 22 996.857143 974.857143
99 2019-07-31 9 30 1384.000000 1354.000000
100 2019-07-31 10 38 1514.142857 1476.142857
101 2019-07-31 11 38 1740.285714 1702.285714
102 2019-07-31 12 38 1968.285714 1930.285714
103 2019-07-31 13 83 1881.625000 1798.625000
104 2019-07-31 14 64 2140.714286 2076.714286
105 2019-07-31 15 66 2199.000000 2133.000000
106 2019-07-31 16 80 2220.857143 2140.857143
107 2019-07-31 17 95 2129.375000 2034.375000
108 2019-07-31 18 115 2227.500000 2112.500000
109 2019-07-31 19 106 2227.750000 2121.750000
110 2019-07-31 20 121 2017.375000 1896.375000
111 2019-07-31 21 96 1777.000000 1681.000000
112 2019-07-31 22 92 1514.250000 1422.250000
113 2019-07-31 23 55 1308.571429 1253.571429
114 2019-08-01 0 541 804.285714 263.285714
115 2019-08-01 1 393 413.571429 20.571429
116 2019-08-01 2 178 194.571429 16.571429
117 2019-08-01 3 115 181.428571 66.428571
118 2019-08-01 4 213 251.428571 38.428571
119 2019-08-01 5 285 307.428571 22.428571
120 2019-08-01 6 394 532.428571 138.428571
121 2019-08-01 7 779 722.857143 -56.142857
122 2019-08-01 8 946 996.857143 50.857143
123 2019-08-01 9 1371 1384.000000 13.000000
124 2019-08-01 10 1484 1514.142857 30.142857
125 2019-08-01 11 1878 1740.285714 -137.714286
126 2019-08-01 12 1979 1968.285714 -10.714286
127 2019-08-01 13 2064 1881.625000 -182.375000
128 2019-08-01 14 2188 2140.714286 -47.285714
129 2019-08-01 15 2573 2199.000000 -374.000000
130 2019-08-01 16 2327 2220.857143 -106.142857
131 2019-08-01 17 2647 2129.375000 -517.625000
132 2019-08-01 18 2457 2227.500000 -229.500000
133 2019-08-01 19 2698 2227.750000 -470.250000
134 2019-08-01 20 2406 2017.375000 -388.625000
135 2019-08-01 21 2355 1777.000000 -578.000000
136 2019-08-01 22 2023 1514.250000 -508.750000
137 2019-08-01 23 1473 1308.571429 -164.428571
138 2019-08-02 0 1261 804.285714 -456.714286
139 2019-08-02 1 351 413.571429 62.571429
140 2019-08-02 2 274 194.571429 -79.428571
141 2019-08-02 3 385 181.428571 -203.571429
142 2019-08-02 4 782 251.428571 -530.571429
143 2019-08-02 5 520 307.428571 -212.571429
144 2019-08-02 6 1043 532.428571 -510.571429
145 2019-08-02 7 728 722.857143 -5.142857
146 2019-08-02 8 929 996.857143 67.857143
147 2019-08-02 9 1089 1384.000000 295.000000
148 2019-08-02 10 1624 1514.142857 -109.857143
149 2019-08-02 11 1846 1740.285714 -105.714286
150 2019-08-02 12 2042 1968.285714 -73.714286
151 2019-08-02 13 2178 1881.625000 -296.375000
152 2019-08-02 14 2038 2140.714286 102.714286
153 2019-08-02 15 2144 2199.000000 55.000000
154 2019-08-02 16 2066 2220.857143 154.857143
155 2019-08-02 17 2324 2129.375000 -194.625000
156 2019-08-02 18 2566 2227.500000 -338.500000
157 2019-08-02 19 2392 2227.750000 -164.250000
158 2019-08-02 20 2016 2017.375000 1.375000
159 2019-08-02 21 1876 1777.000000 -99.000000
160 2019-08-02 22 1727 1514.250000 -212.750000
161 2019-08-02 23 1370 1308.571429 -61.428571
162 2019-08-03 0 823 804.285714 -18.714286
163 2019-08-03 1 364 413.571429 49.571429
164 2019-08-03 2 197 194.571429 -2.428571
165 2019-08-03 3 140 181.428571 41.428571
166 2019-08-03 4 125 251.428571 126.428571
167 2019-08-03 5 372 307.428571 -64.571429
168 2019-08-03 6 408 532.428571 124.428571
169 2019-08-03 7 518 722.857143 204.857143
170 2019-08-03 8 757 996.857143 239.857143
171 2019-08-03 9 1257 1384.000000 127.000000
172 2019-08-03 10 1424 1514.142857 90.142857
173 2019-08-03 11 1925 1740.285714 -184.714286
174 2019-08-03 12 2074 1968.285714 -105.714286
175 2019-08-03 13 2393 1881.625000 -511.375000
176 2019-08-03 14 2443 2140.714286 -302.285714
177 2019-08-03 15 2537 2199.000000 -338.000000
178 2019-08-03 16 2136 2220.857143 84.857143
179 2019-08-03 17 2320 2129.375000 -190.625000
180 2019-08-03 18 2522 2227.500000 -294.500000
181 2019-08-03 19 2060 2227.750000 167.750000
182 2019-08-03 20 2236 2017.375000 -218.625000
183 2019-08-03 21 1737 1777.000000 40.000000
184 2019-08-03 22 1768 1514.250000 -253.750000
185 2019-08-03 23 1066 1308.571429 242.571429
186 2019-08-04 0 591 804.285714 213.285714
187 2019-08-04 1 446 413.571429 -32.428571
188 2019-08-04 2 122 194.571429 72.571429
189 2019-08-04 3 133 181.428571 48.428571
190 2019-08-04 4 156 251.428571 95.428571
191 2019-08-04 5 219 307.428571 88.428571
192 2019-08-04 6 341 532.428571 191.428571
193 2019-08-04 7 561 722.857143 161.857143
194 2019-08-04 8 869 996.857143 127.857143
195 2019-08-04 9 1177 1384.000000 207.000000
196 2019-08-04 10 1460 1514.142857 54.142857
197 2019-08-04 11 1732 1740.285714 8.285714
198 2019-08-04 12 2019 1968.285714 -50.714286
199 2019-08-04 13 2159 1881.625000 -277.375000
200 2019-08-04 14 2473 2140.714286 -332.285714
201 2019-08-04 15 2294 2199.000000 -95.000000
202 2019-08-04 16 2446 2220.857143 -225.142857
203 2019-08-04 17 2011 2129.375000 118.375000
204 2019-08-04 18 2152 2227.500000 75.500000
205 2019-08-04 19 2421 2227.750000 -193.250000
206 2019-08-04 20 2094 2017.375000 -76.625000
207 2019-08-04 21 2054 1777.000000 -277.000000
208 2019-08-04 22 1398 1514.250000 116.250000
209 2019-08-04 23 1255 1308.571429 53.571429
210 2019-08-05 0 832 804.285714 -27.714286
211 2019-08-05 1 578 413.571429 -164.428571
212 2019-08-05 2 237 194.571429 -42.428571
213 2019-08-05 3 179 181.428571 2.428571
214 2019-08-05 4 153 251.428571 98.428571
215 2019-08-05 5 237 307.428571 70.428571
216 2019-08-05 6 475 532.428571 57.428571
217 2019-08-05 7 589 722.857143 133.857143
218 2019-08-05 8 1089 996.857143 -92.142857
219 2019-08-05 9 1463 1384.000000 -79.000000
220 2019-08-05 10 1516 1514.142857 -1.857143
221 2019-08-05 11 1608 1740.285714 132.285714
222 2019-08-05 12 2008 1968.285714 -39.714286
223 2019-08-05 13 2322 1881.625000 -440.375000
224 2019-08-05 14 2142 2140.714286 -1.285714
225 2019-08-05 15 2108 2199.000000 91.000000
226 2019-08-05 16 2516 2220.857143 -295.142857
227 2019-08-05 17 2673 2129.375000 -543.625000
228 2019-08-05 18 2915 2227.500000 -687.500000
229 2019-08-05 19 2856 2227.750000 -628.250000
230 2019-08-05 20 2597 2017.375000 -579.625000
231 2019-08-05 21 2130 1777.000000 -353.000000
232 2019-08-05 22 1500 1514.250000 14.250000
233 2019-08-05 23 1352 1308.571429 -43.428571
234 2019-08-06 0 990 804.285714 -185.714286
235 2019-08-06 1 421 413.571429 -7.428571
236 2019-08-06 2 235 194.571429 -40.428571
237 2019-08-06 3 199 181.428571 -17.571429
238 2019-08-06 4 188 251.428571 63.428571
239 2019-08-06 5 237 307.428571 70.428571
240 2019-08-06 6 597 532.428571 -64.571429
241 2019-08-06 7 836 722.857143 -113.142857
242 2019-08-06 8 1388 996.857143 -391.142857
243 2019-08-06 9 1975 1384.000000 -591.000000
244 2019-08-06 10 1719 1514.142857 -204.857143
245 2019-08-06 11 1695 1740.285714 45.285714
246 2019-08-06 12 1966 1968.285714 2.285714
247 2019-08-06 13 1933 1881.625000 -51.375000
248 2019-08-06 14 1942 2140.714286 198.714286
249 2019-08-06 15 1895 2199.000000 304.000000
250 2019-08-06 16 1987 2220.857143 233.857143
251 2019-08-06 17 2806 2129.375000 -676.625000
252 2019-08-06 18 2442 2227.500000 -214.500000
253 2019-08-06 19 2755 2227.750000 -527.250000
254 2019-08-06 20 2468 2017.375000 -450.625000
255 2019-08-06 21 2406 1777.000000 -629.000000
256 2019-08-06 22 1803 1514.250000 -288.750000
257 2019-08-06 23 1387 1308.571429 -78.428571
258 2019-08-07 0 592 804.285714 212.285714
259 2019-08-07 1 342 413.571429 71.571429
260 2019-08-07 2 119 194.571429 75.571429
261 2019-08-07 3 119 181.428571 62.428571
262 2019-08-07 4 143 251.428571 108.428571
263 2019-08-07 5 282 307.428571 25.428571
264 2019-08-07 6 469 532.428571 63.428571
265 2019-08-07 7 1049 722.857143 -326.142857
266 2019-08-07 8 1000 996.857143 -3.142857
267 2019-08-07 9 1356 1384.000000 28.000000
268 2019-08-07 10 1372 1514.142857 142.142857
269 2019-08-07 11 1498 1740.285714 242.285714
270 2019-08-07 12 1690 1968.285714 278.285714
271 2019-08-07 13 1921 1881.625000 -39.375000
272 2019-08-07 14 1759 2140.714286 381.714286
273 2019-08-07 15 1842 2199.000000 357.000000
274 2019-08-07 16 2068 2220.857143 152.857143
275 2019-08-07 17 2159 2129.375000 -29.625000
276 2019-08-07 18 2651 2227.500000 -423.500000
277 2019-08-07 19 2534 2227.750000 -306.250000
278 2019-08-07 20 2201 2017.375000 -183.625000
279 2019-08-07 21 1562 1777.000000 215.000000
280 2019-08-07 22 1803 1514.250000 -288.750000
281 2019-08-07 23 1257 1308.571429 51.571429
282 2019-08-08 0 68 804.285714 736.285714
In [136]:
per_day_per_hour['when'] = per_day_per_hour['date'].dt.day.astype('str') + ', ' + per_day_per_hour['hour'].astype('str')

Let's plot a line that would how how the recieved values varied from average

In [137]:
fig = px.line(per_day_per_hour,x='when', y="diff", title='Difference between regular event number and recieved')
fig.show()

Conclusion

This dataset contains have 5 unique events and 243713 events in general in the logs dataset with 7551 having nearly 32 events per one. The data in the dataset describes 2 weeks, but not all the days contain properly recieved information, maybe due to technical reasons. I analysed the data and founf an average hamount of events for every hour to compare with the recieved data. I can see on the graph that starting from 2019-08-01 data looks 'normal' and I choose this date ti be the fist point of properly distributed data. When the data was close to the 'average' in July, in was due to regularly low numbers events that is proven by other plots. So the data really represents just the period from 2019-08-01 to 2019-08-08.

Did you lose many events and users when excluding the older data?

In [138]:
good_date = '2019-08-01'

filtered_logs = logs_exp[logs_exp['date'] >= good_date]

bad_data = logs_exp[logs_exp['date'] < good_date]

print(f'After getting rid of bad data (but recived during the half of the whole research time), we lost only lost {round(bad_data.shape[0] / logs_exp.shape[0] * 100,2)}% of data')

print(f'We also lost {bad_data.user_id.nunique()} users')
After getting rid of bad data (but recived during the half of the whole research time), we lost only lost 0.82% of data
We also lost 1319 users

I also want to see, what are the proportions of the remained events.

In [139]:
bad_data.event_name.value_counts(normalize=True) * 100
Out[139]:
MainScreenAppear           60.935143
CartScreenAppear           16.339869
OffersScreenAppear         13.926596
PaymentScreenSuccessful     8.396179
Tutorial                    0.402212
Name: event_name, dtype: float64

Make sure you have users from all three experimental groups.

In [140]:
users_in_group = pd.DataFrame(filtered_logs.groupby('experiment_id')['user_id'].nunique())

fig = px.pie(users_in_group,  values='user_id', names=users_in_group.index, title='Proportions experiment groups')
fig.show()
In [141]:
filtered_logs.groupby('experiment_id')['user_id'].nunique()
Out[141]:
experiment_id
246    2484
247    2517
248    2537
Name: user_id, dtype: int64
In [142]:
events_count = pd.DataFrame(filtered_logs.experiment_id.value_counts())
In [143]:
fig = px.histogram(events_count, x=events_count.index, y = 'experiment_id',
                   title='Number of events per group')
fig.show()

Conclusion

After getting rid of bad data (but recived during the half of the whole research time), we lost only lost 0.82% of data, that included visits of 1319 users. Also I checked the proportions of the event types and their order remained the same. Also I checked how many users are left in the filtered data, but they are still nicely distributed.

Step 4. The event funnel

Frequency of event occurrence

In [144]:
fig = px.pie(filtered_logs, values='user_id', names='event_name', title='Proportions of the events in the filtered dataset')
fig.show()
In [145]:
print('Amount of events:\n',filtered_logs.event_name.value_counts())
Amount of events:
 MainScreenAppear           117889
OffersScreenAppear          46531
CartScreenAppear            42343
PaymentScreenSuccessful     33951
Tutorial                     1010
Name: event_name, dtype: int64
In [146]:
filtered_logs.groupby('user_id')['event_name'].apply(lambda x: x.mode()).value_counts()
Out[146]:
MainScreenAppear           6035
OffersScreenAppear         1176
CartScreenAppear            741
PaymentScreenSuccessful     173
Tutorial                     29
Name: event_name, dtype: int64

Conclusion

We can see that MainScreenAppear is an undoubtle leader followed by OffersScreenAppear and CartScreenAppear that both have 2 time fewer users. It is also the most frequent event for 6035 users

Users who performed each action

In [147]:
amount_of_events = pd.DataFrame(filtered_logs.groupby('user_id')['event_name'].nunique().reset_index().groupby('event_name')['user_id'].nunique())
amount_of_events
Out[147]:
user_id
event_name
1 2717
2 1006
3 319
4 3027
5 469
In [148]:
fig = px.histogram(amount_of_events, x=amount_of_events.index, y = 'user_id', nbins=5,
                   title='Number of users per amount of events')
fig.show()
In [149]:
user_events = filtered_logs.groupby('user_id')['event_name'].nunique().reset_index(drop=False)

user_events.columns = ['user_id', 'number_of_events']
In [150]:
users_5_actions = list(user_events[user_events['number_of_events'] == 5]['user_id'])

users_4_actions = list(user_events[user_events['number_of_events'] == 4]['user_id'])

Conclusion

In [151]:
print(f'We can see that there is the highest amount of users who did only 1 action - presumably MainScreenAppear and ofthose who did 4 actions - all the neccasiare ones, but not the tutorial. So we have {len(users_4_actions)} - {round(len(users_4_actions) / filtered_logs.user_id.nunique() * 100,2)}% of all users -  who performed all the 5 actions except for Tutorial and {len(users_5_actions)} including it. ')
We can see that there is the highest amount of users who did only 1 action - presumably MainScreenAppear and ofthose who did 4 actions - all the neccasiare ones, but not the tutorial. So we have 3027 - 40.16% of all users -  who performed all the 5 actions except for Tutorial and 469 including it. 

Sort the events by the number of users

In [152]:
number_of_users = pd.DataFrame(filtered_logs.groupby('event_name')['user_id'].nunique().sort_values(ascending=False))
number_of_users
Out[152]:
user_id
event_name
MainScreenAppear 7423
OffersScreenAppear 4597
CartScreenAppear 3736
PaymentScreenSuccessful 3540
Tutorial 843
In [153]:
fig = px.histogram(number_of_users, x=number_of_users.index, y = 'user_id', nbins=5,
                   title='Number of users per event')
fig.show()

Conclusion

We can see that the number of users goes down on the PaymentScreenSuccessful section, but the amount of events as we saw above is still low. That mean that there areusers who make a lot of purchases

Proportion of users who performed the action at least once.

performed any action just once

In [154]:
users_1_action = list(user_events[user_events['number_of_events'] == 1]['user_id'])
In [155]:
print(f'We have {len(users_1_action)} users performed just one action, the MainScreenAppear presumably, its {round(len(users_1_action) / filtered_logs.user_id.nunique() * 100,2)}% of all users')
We have 2717 users performed just one action, the MainScreenAppear presumably, its 36.04% of all users
In [156]:
print(f' But those who performed the action at least one and more actions are {round((filtered_logs.user_id.nunique() - len(users_5_actions)) / filtered_logs.user_id.nunique() * 100,2)}%')
 But those who performed the action at least one and more actions are 93.78%

Order of actions

I beleive that tutorial has such low values because it is not the necessaire part of events, it's extra. So the order is as follows

In [157]:
order = ['MainScreenAppear', 'OffersScreenAppear', 'CartScreenAppear', 'PaymentScreenSuccessful']

Event funnel

In [158]:
funnel = filtered_logs.groupby(
    'event_name')['user_id'].nunique().sort_values(ascending=False).reset_index()

funnel['pct'] = funnel.user_id.pct_change()

funnel = funnel.drop(funnel[funnel.event_name =='Tutorial'].index)

funnel
Out[158]:
event_name user_id pct
0 MainScreenAppear 7423 NaN
1 OffersScreenAppear 4597 -0.380709
2 CartScreenAppear 3736 -0.187296
3 PaymentScreenSuccessful 3540 -0.052463
In [159]:
fig = px.funnel(funnel, x ='event_name', y = 'user_id', title='Total funnel')
fig.show()
In [160]:
funnel_list = []

for exp in filtered_logs.experiment_id.unique():
    print(exp)
    funnel_ = filtered_logs[filtered_logs.experiment_id == exp]
    funnel_ = pd.DataFrame(funnel_.groupby('event_name')['user_id'].nunique().sort_values( ascending=False).reset_index())
    funnel_['pct'] = funnel_.user_id.pct_change()
    funnel_['group'] = exp
    print(funnel_)
    funnel_list.append(funnel_)
247
                event_name  user_id       pct group
0         MainScreenAppear     2479       NaN   247
1       OffersScreenAppear     1524 -0.385236   247
2         CartScreenAppear     1239 -0.187008   247
3  PaymentScreenSuccessful     1158 -0.065375   247
4                 Tutorial      284 -0.754750   247
248
                event_name  user_id       pct group
0         MainScreenAppear     2494       NaN   248
1       OffersScreenAppear     1531 -0.386127   248
2         CartScreenAppear     1231 -0.195950   248
3  PaymentScreenSuccessful     1182 -0.039805   248
4                 Tutorial      281 -0.762267   248
246
                event_name  user_id       pct group
0         MainScreenAppear     2450       NaN   246
1       OffersScreenAppear     1542 -0.370612   246
2         CartScreenAppear     1266 -0.178988   246
3  PaymentScreenSuccessful     1200 -0.052133   246
4                 Tutorial      278 -0.768333   246

Conclusion

OffersScreenAppear has almost 38% lower events than MainScreenAppear in general and for every group. But further values varies a little

In [161]:
total_funnel = pd.concat(funnel_list, axis=0)
In [162]:
fig = px.funnel(total_funnel, x ='event_name', y = 'user_id', color='group')
fig.show()

Stage with highest lose rate

In [163]:
funnel[funnel.pct == funnel.pct.min()].event_name
Out[163]:
1    OffersScreenAppear
Name: event_name, dtype: category
Categories (5, object): ['CartScreenAppear', 'MainScreenAppear', 'OffersScreenAppear', 'PaymentScreenSuccessful', 'Tutorial']

Share of users who make the entire journey from their first event to payment

In [164]:
every_event_per_user = filtered_logs.pivot_table(index='user_id', columns='event_name', values='timestamp', aggfunc='min')

every_event_per_user = every_event_per_user.drop('Tutorial', axis=1)

every_event_per_user = every_event_per_user.dropna(how='any')
In [165]:
print(f'We have {every_event_per_user.shape[0]} who had this whole journey. it is {round(every_event_per_user.shape[0] / filtered_logs.user_id.nunique() * 100,2)}% of all users')
We have 3430 who had this whole journey. it is 45.5% of all users

Conclusion

In this step we found the leader event: MainScreenAppear. The biggest gap happends between MainScreenAppear and OffersScreenAppear. interesting is that the number of users goes down on the PaymentScreenSuccessful section, but the amount of events as we saw above is still low. That mean that there areusers who make a lot of purchases. Also see that there is the highest amount of users with 1 action and with 4 actions. 3027 - 40.16% of all users performed all the 5 actions except for Tutorial and 469 including it. We decided to establish the order:'MainScreenAppear', 'OffersScreenAppear', 'CartScreenAppear', 'PaymentScreenSuccessful' and found that is was followed by almost a half or our visitors - 45.5%. That's good We really need to check why half people find nothing interesting on the MainScreenAppear page and just go away. Maybe we can drug their attention with customized products they would like, maybe show them good discounts, or maybe make them interact by other way: possibly put a little 2-d game where a user who get's to the end, can choose discount for any types of products. This can drag people's attnetion.

Step 5. Study the results of the experiment

Amount of users in each group

In [166]:
users_per_experiment = pd.DataFrame(filtered_logs.groupby('experiment_id')['user_id'].nunique())
In [167]:
fig = px.histogram(users_per_experiment, x=users_per_experiment.index, y = 'user_id', nbins=3,
                   title='Number of users per experiment')
fig.show()

It is still almost equal

Statistically significant difference

Let's see how many users we have per evety eent per every group

In [168]:
pivot = filtered_logs.pivot_table(values='user_id', index='event_name', columns='experiment_id',aggfunc=lambda x: x.nunique())

pivot = pivot.sort_values(by='246', ascending=False)
pivot
Out[168]:
experiment_id 246 247 248
event_name
MainScreenAppear 2450 2479 2494
OffersScreenAppear 1542 1524 1531
CartScreenAppear 1266 1239 1231
PaymentScreenSuccessful 1200 1158 1182
Tutorial 278 284 281

Statistically significant difference between samples 246 and 247.

I have a H0 hypothesis that there is no statistically significant difference between 246 and 247 groups. Aternative hypothesis is that there is one.

In [169]:
def statistical_difference(group1, group2, alpha):
    alpha = alpha
    trials1 = filtered_logs[filtered_logs.experiment_id==group1].user_id.nunique()
    trials2 = filtered_logs[filtered_logs.experiment_id==group2].user_id.nunique()
    for event in list(filtered_logs.event_name.unique()):
        success1 = pivot.loc[event, group1]
        success2 = pivot.loc[event, group2]
        p1 = success1/trials1
        p2 = success2/trials2
        difference = p1 - p2
        p_combined = (success1 + success2) / (trials1 + trials2)
        z_value = difference / mth.sqrt(p_combined * (1 - p_combined) * (1/trials1 + 1/trials2))
        distr = st.norm(0, 1)
        p_value = (1 - distr.cdf(abs(z_value))) * 2
        print(f'Succes for {group1} is {success1}, for event: {event} out of {trials1} trials\n')
        print(f'Succes for {group2} is {success2}, for event: {event} out of {trials2} trials\n')

        print(f'p-value for {event}: ', p_value)
        if (p_value < alpha):
            print(f"Rejecting the null hypothesis for {group1} and {group2} on event {event}: there is a significant difference between the proportions\n\n")
        else:
            print(f"Failed to reject the null hypothesis for {group1} and {group2} on event {event}: there is no reason to consider the proportions different\n\n")
In [170]:
alpha = 0.05

statistical_difference('246', '247', alpha)
Succes for 246 is 2450, for event: MainScreenAppear out of 2484 trials

Succes for 247 is 2479, for event: MainScreenAppear out of 2517 trials

p-value for MainScreenAppear:  0.6756217702005545
Failed to reject the null hypothesis for 246 and 247 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 246 is 1542, for event: OffersScreenAppear out of 2484 trials

Succes for 247 is 1524, for event: OffersScreenAppear out of 2517 trials

p-value for OffersScreenAppear:  0.26698769175859516
Failed to reject the null hypothesis for 246 and 247 on event OffersScreenAppear: there is no reason to consider the proportions different


Succes for 246 is 1200, for event: PaymentScreenSuccessful out of 2484 trials

Succes for 247 is 1158, for event: PaymentScreenSuccessful out of 2517 trials

p-value for PaymentScreenSuccessful:  0.10298394982948822
Failed to reject the null hypothesis for 246 and 247 on event PaymentScreenSuccessful: there is no reason to consider the proportions different


Succes for 246 is 1266, for event: CartScreenAppear out of 2484 trials

Succes for 247 is 1239, for event: CartScreenAppear out of 2517 trials

p-value for CartScreenAppear:  0.2182812140633792
Failed to reject the null hypothesis for 246 and 247 on event CartScreenAppear: there is no reason to consider the proportions different


Succes for 246 is 278, for event: Tutorial out of 2484 trials

Succes for 247 is 284, for event: Tutorial out of 2517 trials

p-value for Tutorial:  0.9182790262812368
Failed to reject the null hypothesis for 246 and 247 on event Tutorial: there is no reason to consider the proportions different


Coclusion

We found no signinficant difference between groups 246 and 247

Statistically significant difference between all samples

I have a H0 hypothesis that there is no statistically significant difference between groups that I check. Aternative hypothesis is that there is one.

In [171]:
for i in list(itertools.combinations(filtered_logs.experiment_id.unique(),2)):
    print(f'{i} test: \n')
    statistical_difference(i[0], i[1], alpha)
    print('*'*90)
('247', '248') test: 

Succes for 247 is 2479, for event: MainScreenAppear out of 2517 trials

Succes for 248 is 2494, for event: MainScreenAppear out of 2537 trials

p-value for MainScreenAppear:  0.6001661582453706
Failed to reject the null hypothesis for 247 and 248 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1524, for event: OffersScreenAppear out of 2517 trials

Succes for 248 is 1531, for event: OffersScreenAppear out of 2537 trials

p-value for OffersScreenAppear:  0.8835956656016957
Failed to reject the null hypothesis for 247 and 248 on event OffersScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1158, for event: PaymentScreenSuccessful out of 2517 trials

Succes for 248 is 1182, for event: PaymentScreenSuccessful out of 2537 trials

p-value for PaymentScreenSuccessful:  0.6775413642906454
Failed to reject the null hypothesis for 247 and 248 on event PaymentScreenSuccessful: there is no reason to consider the proportions different


Succes for 247 is 1239, for event: CartScreenAppear out of 2517 trials

Succes for 248 is 1231, for event: CartScreenAppear out of 2537 trials

p-value for CartScreenAppear:  0.6169517476996997
Failed to reject the null hypothesis for 247 and 248 on event CartScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 284, for event: Tutorial out of 2517 trials

Succes for 248 is 281, for event: Tutorial out of 2537 trials

p-value for Tutorial:  0.8151967015119994
Failed to reject the null hypothesis for 247 and 248 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************
('247', '246') test: 

Succes for 247 is 2479, for event: MainScreenAppear out of 2517 trials

Succes for 246 is 2450, for event: MainScreenAppear out of 2484 trials

p-value for MainScreenAppear:  0.6756217702005545
Failed to reject the null hypothesis for 247 and 246 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1524, for event: OffersScreenAppear out of 2517 trials

Succes for 246 is 1542, for event: OffersScreenAppear out of 2484 trials

p-value for OffersScreenAppear:  0.26698769175859516
Failed to reject the null hypothesis for 247 and 246 on event OffersScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1158, for event: PaymentScreenSuccessful out of 2517 trials

Succes for 246 is 1200, for event: PaymentScreenSuccessful out of 2484 trials

p-value for PaymentScreenSuccessful:  0.10298394982948822
Failed to reject the null hypothesis for 247 and 246 on event PaymentScreenSuccessful: there is no reason to consider the proportions different


Succes for 247 is 1239, for event: CartScreenAppear out of 2517 trials

Succes for 246 is 1266, for event: CartScreenAppear out of 2484 trials

p-value for CartScreenAppear:  0.2182812140633792
Failed to reject the null hypothesis for 247 and 246 on event CartScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 284, for event: Tutorial out of 2517 trials

Succes for 246 is 278, for event: Tutorial out of 2484 trials

p-value for Tutorial:  0.9182790262812368
Failed to reject the null hypothesis for 247 and 246 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************
('248', '246') test: 

Succes for 248 is 2494, for event: MainScreenAppear out of 2537 trials

Succes for 246 is 2450, for event: MainScreenAppear out of 2484 trials

p-value for MainScreenAppear:  0.34705881021236484
Failed to reject the null hypothesis for 248 and 246 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 248 is 1531, for event: OffersScreenAppear out of 2537 trials

Succes for 246 is 1542, for event: OffersScreenAppear out of 2484 trials

p-value for OffersScreenAppear:  0.20836205402738917
Failed to reject the null hypothesis for 248 and 246 on event OffersScreenAppear: there is no reason to consider the proportions different


Succes for 248 is 1182, for event: PaymentScreenSuccessful out of 2537 trials

Succes for 246 is 1200, for event: PaymentScreenSuccessful out of 2484 trials

p-value for PaymentScreenSuccessful:  0.22269358994682742
Failed to reject the null hypothesis for 248 and 246 on event PaymentScreenSuccessful: there is no reason to consider the proportions different


Succes for 248 is 1231, for event: CartScreenAppear out of 2537 trials

Succes for 246 is 1266, for event: CartScreenAppear out of 2484 trials

p-value for CartScreenAppear:  0.08328412977507749
Failed to reject the null hypothesis for 248 and 246 on event CartScreenAppear: there is no reason to consider the proportions different


Succes for 248 is 281, for event: Tutorial out of 2537 trials

Succes for 246 is 278, for event: Tutorial out of 2484 trials

p-value for Tutorial:  0.8964489622133207
Failed to reject the null hypothesis for 248 and 246 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************

Coclusion

And we found no statistical difference what so ever

In [172]:
funnel[funnel.user_id == funnel.user_id.max()].event_name
Out[172]:
0    MainScreenAppear
Name: event_name, dtype: category
Categories (5, object): ['CartScreenAppear', 'MainScreenAppear', 'OffersScreenAppear', 'PaymentScreenSuccessful', 'Tutorial']
In [173]:
def find_share(group1, group2, event):
    print(
        f'share of users from {group1}: {round(pivot.loc[event, group1]/pivot.loc[event, :].sum() * 100,2)}%. ')
    print(
        f'share of users from {group2}: {round(pivot.loc[event, group2]/pivot.loc[event, :].sum() * 100,2)}%. ')
    print()
In [174]:
for i in list(itertools.combinations(filtered_logs.experiment_id.unique(),2)):
    print(f'Share of MainScreenAppear event for groups {i}')
    find_share(i[0],i[1], 'MainScreenAppear')
Share of MainScreenAppear event for groups ('247', '248')
share of users from 247: 33.4%. 
share of users from 248: 33.6%. 

Share of MainScreenAppear event for groups ('247', '246')
share of users from 247: 33.4%. 
share of users from 246: 33.01%. 

Share of MainScreenAppear event for groups ('248', '246')
share of users from 248: 33.6%. 
share of users from 246: 33.01%. 

Conclusion

The number of users looks properly distributed and have equal proportions

The share of the number of users on every stage

In [175]:
for i in list(itertools.combinations(filtered_logs.experiment_id.unique(),2)):
    for event in list(filtered_logs.event_name.unique()):
        print(f'Share of {event} event for groups {i}')
        find_share(i[0],i[1], event)
Share of MainScreenAppear event for groups ('247', '248')
share of users from 247: 33.4%. 
share of users from 248: 33.6%. 

Share of OffersScreenAppear event for groups ('247', '248')
share of users from 247: 33.15%. 
share of users from 248: 33.3%. 

Share of PaymentScreenSuccessful event for groups ('247', '248')
share of users from 247: 32.71%. 
share of users from 248: 33.39%. 

Share of CartScreenAppear event for groups ('247', '248')
share of users from 247: 33.16%. 
share of users from 248: 32.95%. 

Share of Tutorial event for groups ('247', '248')
share of users from 247: 33.69%. 
share of users from 248: 33.33%. 

Share of MainScreenAppear event for groups ('247', '246')
share of users from 247: 33.4%. 
share of users from 246: 33.01%. 

Share of OffersScreenAppear event for groups ('247', '246')
share of users from 247: 33.15%. 
share of users from 246: 33.54%. 

Share of PaymentScreenSuccessful event for groups ('247', '246')
share of users from 247: 32.71%. 
share of users from 246: 33.9%. 

Share of CartScreenAppear event for groups ('247', '246')
share of users from 247: 33.16%. 
share of users from 246: 33.89%. 

Share of Tutorial event for groups ('247', '246')
share of users from 247: 33.69%. 
share of users from 246: 32.98%. 

Share of MainScreenAppear event for groups ('248', '246')
share of users from 248: 33.6%. 
share of users from 246: 33.01%. 

Share of OffersScreenAppear event for groups ('248', '246')
share of users from 248: 33.3%. 
share of users from 246: 33.54%. 

Share of PaymentScreenSuccessful event for groups ('248', '246')
share of users from 248: 33.39%. 
share of users from 246: 33.9%. 

Share of CartScreenAppear event for groups ('248', '246')
share of users from 248: 32.95%. 
share of users from 246: 33.89%. 

Share of Tutorial event for groups ('248', '246')
share of users from 248: 33.33%. 
share of users from 246: 32.98%. 

Conclusion

And this is true for every stage and every group. No significat difference.

Significance level of the statistical hypotheses

Calculate how many statistical hypothesis tests you carried out.

In [176]:
number_of_tests = filtered_logs.event_name.nunique() * filtered_logs.experiment_id.nunique()

Since I really have a lot of tests I need to be sure and needd to adjust the value of alpha for it. I checked the Bonferroni method and got to the same conclusions (see below), but since we should really care about all the error types I also used the Šidák approach, because it offers higher power and got some new answer

Bonferroni

In [177]:
bonferroni = alpha / number_of_tests

for i in list(itertools.combinations(filtered_logs.experiment_id.unique(),2)):
    print(f'{i} test: \n')
    statistical_difference(i[0], i[1], bonferroni)
    print('*'*90)
('247', '248') test: 

Succes for 247 is 2479, for event: MainScreenAppear out of 2517 trials

Succes for 248 is 2494, for event: MainScreenAppear out of 2537 trials

p-value for MainScreenAppear:  0.6001661582453706
Failed to reject the null hypothesis for 247 and 248 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1524, for event: OffersScreenAppear out of 2517 trials

Succes for 248 is 1531, for event: OffersScreenAppear out of 2537 trials

p-value for OffersScreenAppear:  0.8835956656016957
Failed to reject the null hypothesis for 247 and 248 on event OffersScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1158, for event: PaymentScreenSuccessful out of 2517 trials

Succes for 248 is 1182, for event: PaymentScreenSuccessful out of 2537 trials

p-value for PaymentScreenSuccessful:  0.6775413642906454
Failed to reject the null hypothesis for 247 and 248 on event PaymentScreenSuccessful: there is no reason to consider the proportions different


Succes for 247 is 1239, for event: CartScreenAppear out of 2517 trials

Succes for 248 is 1231, for event: CartScreenAppear out of 2537 trials

p-value for CartScreenAppear:  0.6169517476996997
Failed to reject the null hypothesis for 247 and 248 on event CartScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 284, for event: Tutorial out of 2517 trials

Succes for 248 is 281, for event: Tutorial out of 2537 trials

p-value for Tutorial:  0.8151967015119994
Failed to reject the null hypothesis for 247 and 248 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************
('247', '246') test: 

Succes for 247 is 2479, for event: MainScreenAppear out of 2517 trials

Succes for 246 is 2450, for event: MainScreenAppear out of 2484 trials

p-value for MainScreenAppear:  0.6756217702005545
Failed to reject the null hypothesis for 247 and 246 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1524, for event: OffersScreenAppear out of 2517 trials

Succes for 246 is 1542, for event: OffersScreenAppear out of 2484 trials

p-value for OffersScreenAppear:  0.26698769175859516
Failed to reject the null hypothesis for 247 and 246 on event OffersScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1158, for event: PaymentScreenSuccessful out of 2517 trials

Succes for 246 is 1200, for event: PaymentScreenSuccessful out of 2484 trials

p-value for PaymentScreenSuccessful:  0.10298394982948822
Failed to reject the null hypothesis for 247 and 246 on event PaymentScreenSuccessful: there is no reason to consider the proportions different


Succes for 247 is 1239, for event: CartScreenAppear out of 2517 trials

Succes for 246 is 1266, for event: CartScreenAppear out of 2484 trials

p-value for CartScreenAppear:  0.2182812140633792
Failed to reject the null hypothesis for 247 and 246 on event CartScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 284, for event: Tutorial out of 2517 trials

Succes for 246 is 278, for event: Tutorial out of 2484 trials

p-value for Tutorial:  0.9182790262812368
Failed to reject the null hypothesis for 247 and 246 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************
('248', '246') test: 

Succes for 248 is 2494, for event: MainScreenAppear out of 2537 trials

Succes for 246 is 2450, for event: MainScreenAppear out of 2484 trials

p-value for MainScreenAppear:  0.34705881021236484
Failed to reject the null hypothesis for 248 and 246 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 248 is 1531, for event: OffersScreenAppear out of 2537 trials

Succes for 246 is 1542, for event: OffersScreenAppear out of 2484 trials

p-value for OffersScreenAppear:  0.20836205402738917
Failed to reject the null hypothesis for 248 and 246 on event OffersScreenAppear: there is no reason to consider the proportions different


Succes for 248 is 1182, for event: PaymentScreenSuccessful out of 2537 trials

Succes for 246 is 1200, for event: PaymentScreenSuccessful out of 2484 trials

p-value for PaymentScreenSuccessful:  0.22269358994682742
Failed to reject the null hypothesis for 248 and 246 on event PaymentScreenSuccessful: there is no reason to consider the proportions different


Succes for 248 is 1231, for event: CartScreenAppear out of 2537 trials

Succes for 246 is 1266, for event: CartScreenAppear out of 2484 trials

p-value for CartScreenAppear:  0.08328412977507749
Failed to reject the null hypothesis for 248 and 246 on event CartScreenAppear: there is no reason to consider the proportions different


Succes for 248 is 281, for event: Tutorial out of 2537 trials

Succes for 246 is 278, for event: Tutorial out of 2484 trials

p-value for Tutorial:  0.8964489622133207
Failed to reject the null hypothesis for 248 and 246 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************

Nothing new here

Šidák

In [178]:
Šidák = 1 - pow((1-alpha),number_of_tests)
In [179]:
for i in list(itertools.combinations(filtered_logs.experiment_id.unique(),2)):
    print(f'{i} test: \n')
    statistical_difference(i[0], i[1], Šidák)
    print('*'*90)
('247', '248') test: 

Succes for 247 is 2479, for event: MainScreenAppear out of 2517 trials

Succes for 248 is 2494, for event: MainScreenAppear out of 2537 trials

p-value for MainScreenAppear:  0.6001661582453706
Failed to reject the null hypothesis for 247 and 248 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1524, for event: OffersScreenAppear out of 2517 trials

Succes for 248 is 1531, for event: OffersScreenAppear out of 2537 trials

p-value for OffersScreenAppear:  0.8835956656016957
Failed to reject the null hypothesis for 247 and 248 on event OffersScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1158, for event: PaymentScreenSuccessful out of 2517 trials

Succes for 248 is 1182, for event: PaymentScreenSuccessful out of 2537 trials

p-value for PaymentScreenSuccessful:  0.6775413642906454
Failed to reject the null hypothesis for 247 and 248 on event PaymentScreenSuccessful: there is no reason to consider the proportions different


Succes for 247 is 1239, for event: CartScreenAppear out of 2517 trials

Succes for 248 is 1231, for event: CartScreenAppear out of 2537 trials

p-value for CartScreenAppear:  0.6169517476996997
Failed to reject the null hypothesis for 247 and 248 on event CartScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 284, for event: Tutorial out of 2517 trials

Succes for 248 is 281, for event: Tutorial out of 2537 trials

p-value for Tutorial:  0.8151967015119994
Failed to reject the null hypothesis for 247 and 248 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************
('247', '246') test: 

Succes for 247 is 2479, for event: MainScreenAppear out of 2517 trials

Succes for 246 is 2450, for event: MainScreenAppear out of 2484 trials

p-value for MainScreenAppear:  0.6756217702005545
Failed to reject the null hypothesis for 247 and 246 on event MainScreenAppear: there is no reason to consider the proportions different


Succes for 247 is 1524, for event: OffersScreenAppear out of 2517 trials

Succes for 246 is 1542, for event: OffersScreenAppear out of 2484 trials

p-value for OffersScreenAppear:  0.26698769175859516
Rejecting the null hypothesis for 247 and 246 on event OffersScreenAppear: there is a significant difference between the proportions


Succes for 247 is 1158, for event: PaymentScreenSuccessful out of 2517 trials

Succes for 246 is 1200, for event: PaymentScreenSuccessful out of 2484 trials

p-value for PaymentScreenSuccessful:  0.10298394982948822
Rejecting the null hypothesis for 247 and 246 on event PaymentScreenSuccessful: there is a significant difference between the proportions


Succes for 247 is 1239, for event: CartScreenAppear out of 2517 trials

Succes for 246 is 1266, for event: CartScreenAppear out of 2484 trials

p-value for CartScreenAppear:  0.2182812140633792
Rejecting the null hypothesis for 247 and 246 on event CartScreenAppear: there is a significant difference between the proportions


Succes for 247 is 284, for event: Tutorial out of 2517 trials

Succes for 246 is 278, for event: Tutorial out of 2484 trials

p-value for Tutorial:  0.9182790262812368
Failed to reject the null hypothesis for 247 and 246 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************
('248', '246') test: 

Succes for 248 is 2494, for event: MainScreenAppear out of 2537 trials

Succes for 246 is 2450, for event: MainScreenAppear out of 2484 trials

p-value for MainScreenAppear:  0.34705881021236484
Rejecting the null hypothesis for 248 and 246 on event MainScreenAppear: there is a significant difference between the proportions


Succes for 248 is 1531, for event: OffersScreenAppear out of 2537 trials

Succes for 246 is 1542, for event: OffersScreenAppear out of 2484 trials

p-value for OffersScreenAppear:  0.20836205402738917
Rejecting the null hypothesis for 248 and 246 on event OffersScreenAppear: there is a significant difference between the proportions


Succes for 248 is 1182, for event: PaymentScreenSuccessful out of 2537 trials

Succes for 246 is 1200, for event: PaymentScreenSuccessful out of 2484 trials

p-value for PaymentScreenSuccessful:  0.22269358994682742
Rejecting the null hypothesis for 248 and 246 on event PaymentScreenSuccessful: there is a significant difference between the proportions


Succes for 248 is 1231, for event: CartScreenAppear out of 2537 trials

Succes for 246 is 1266, for event: CartScreenAppear out of 2484 trials

p-value for CartScreenAppear:  0.08328412977507749
Rejecting the null hypothesis for 248 and 246 on event CartScreenAppear: there is a significant difference between the proportions


Succes for 248 is 281, for event: Tutorial out of 2537 trials

Succes for 246 is 278, for event: Tutorial out of 2484 trials

p-value for Tutorial:  0.8964489622133207
Failed to reject the null hypothesis for 248 and 246 on event Tutorial: there is no reason to consider the proportions different


******************************************************************************************

We found a statistically significant difference between 247 and 246 and between 248 and 246 on the events: OffersScreenAppear, PaymentScreenSuccessful, CartScreenAppear. But that's weird because 246 and 247 are the control groups, and 248 is the test group

General Conclusion

This dataset have 5 unique events and 243713 events in general with 7551 user having nearly 32 events per one. Even though the data containes information on 2 weeks period, we can consider only the second week worth - since 01-08, because prior to it the data was incomplete. I found that the amount of events is naturally higher at day and naturally lower at night. After getting rid of the fist week data we lost only lost 0.82% of data, that included visits of 1319 users.
Thean I found the leader event: MainScreenAppear and establish customer journey: 'MainScreenAppear', 'OffersScreenAppear', 'CartScreenAppear', 'PaymentScreenSuccessful'. It was followed by almost a half or our visitors - 45.5%. Funnel showed me that the biggest procentage of user loss happens between MainScreenAppear and OffersScreenAppear, and we should pay more attention to it and to find te reasons why. Also interesting is that the number of users goes down on the PaymentScreenSuccessful section, but the amount of events as we saw above is still low. That mean that there are users who make a lot of purchases, our loyal customers. Never the less we have its 36.04% of all users who performed just one action, but 93.78%, who performed it all. We really need to check why half of users finds nothing interesting on the MainScreenAppear page and just goes away. Maybe we can drug their attention with customized products they would like, maybe show them good discounts, or maybe make them interact by other way: possibly put a little 2-d game where a user who get's to the end, can choose discount for any types of products.


I found no difference in proportions for all the groups. Also using both Šidák and Bonferroni approach I found no statistically significant diggerence between the test group and control groups. So we can say that the test failed to bring us more conversion.